I have neither given nor received unauthorized assistance on this test or assignment. -Eric Warren
Here we are going to read in the mtcars data set
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
mtcars <- as_tibble(mtcars)
Here we are going to compute the correlation coefficient between mpg and all other features in the data set and then find the two features most strongly correlated with mpg.
corCars <- as_tibble(cor(mtcars$mpg, mtcars[ ,c(-1)])) %>%
gather(variable, correlation)
corCars # Print the correlation dataframe
## # A tibble: 10 × 2
## variable correlation
## <chr> <dbl>
## 1 cyl -0.852
## 2 disp -0.848
## 3 hp -0.776
## 4 drat 0.681
## 5 wt -0.868
## 6 qsec 0.419
## 7 vs 0.664
## 8 am 0.600
## 9 gear 0.480
## 10 carb -0.551
As we can see, there are some variables that correlate with mpg more than others. Here we are going to examine what the top 2 most correlated ones are.
corCars$absCorrelation <- abs(corCars$correlation)
corCarsTop2 <- corCars %>%
arrange(desc(absCorrelation)) %>%
slice(1:2) %>%
dplyr::select(-absCorrelation)
We can see that two highest predictor variables that correlate with mpg are wt and cyl with correlations of -0.8676594 and -0.852162 respectively.
Here we are going to fit two simple linear regression models based on the top two variables we got for highest correlation. Model 1 will using the strongest feature (wt) and model 2 using the second strongest feature (cyl).
carsModel1 <- lm(mpg ~ wt, mtcars)
summary(carsModel1)
##
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5432 -2.3647 -0.1252 1.4096 6.8727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
## wt -5.3445 0.5591 -9.559 1.29e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
## F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
carsModel2 <- lm(mpg ~ cyl, mtcars)
summary(carsModel2)
##
## Call:
## lm(formula = mpg ~ cyl, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.9814 -2.1185 0.2217 1.0717 7.5186
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.8846 2.0738 18.27 < 2e-16 ***
## cyl -2.8758 0.3224 -8.92 6.11e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.206 on 30 degrees of freedom
## Multiple R-squared: 0.7262, Adjusted R-squared: 0.7171
## F-statistic: 79.56 on 1 and 30 DF, p-value: 6.113e-10
Model 1 (which is looking at the affect weight has on mpg) has an adjusted R2 value of 0.7445939, which shows that 74.4593887% of the model can be explained by the data of car weight. It also has a residual standard error of 3.0458821.
Model 2 (which is looking at the affect the number of cylinders has on mpg) has an adjusted R2 value of 0.7170527, which shows that 71.7052672% of the model can be explained by the data of the number of cylinders in a car. It also has a residual standard error of 3.205902.
In determining better linear regression models, we want to find what has the higher adjusted R2 value and a lower residual standard error. Since Model 1 has a higher adjusted R2 value and a lower residual standard error, this model of looking at the car weight’s affect on the mpg is an overall better model.
Here we are going to fit a multiple linear regression model with all features. We are going to find which features are significant and what is the value of R2.
allCarsModel <- lm(mpg ~ ., mtcars)
summary(allCarsModel)
##
## Call:
## lm(formula = mpg ~ ., data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4506 -1.6044 -0.1196 1.2193 4.6271
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.30337 18.71788 0.657 0.5181
## cyl -0.11144 1.04502 -0.107 0.9161
## disp 0.01334 0.01786 0.747 0.4635
## hp -0.02148 0.02177 -0.987 0.3350
## drat 0.78711 1.63537 0.481 0.6353
## wt -3.71530 1.89441 -1.961 0.0633 .
## qsec 0.82104 0.73084 1.123 0.2739
## vs 0.31776 2.10451 0.151 0.8814
## am 2.52023 2.05665 1.225 0.2340
## gear 0.65541 1.49326 0.439 0.6652
## carb -0.19942 0.82875 -0.241 0.8122
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared: 0.869, Adjusted R-squared: 0.8066
## F-statistic: 13.93 on 10 and 21 DF, p-value: 3.793e-07
In this model, we can see that only one of the features are significant (and that is only when alpha is evaluated at 0.10) and this feature is the weight of the car (wt). This variable was also the most correlated with mpg of a car. If we look at alpha at a 0.05 level, then none of the predictor variables are significant. The adjusted R2 value is 0.8066423, which shows that 80.6642319% of the model can be explained by the predictor variables we are given.
Here we are going to identify the best subset of features. by fitting a multiple linear regression model using the best subset of features. Write down the regression formula and R2 for this model. Are any of the features from (a) included in this model? Do they have the same coefficients as they had in model 1 or model 2 from (b)? If the coefficient values have changed, explain why.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
noCarsModel <- lm(mpg ~ 1, mtcars)
modelTesting <- stepAIC(noCarsModel,
scope = list(lower = noCarsModel,
upper = allCarsModel),
direction = "both")
## Start: AIC=115.94
## mpg ~ 1
##
## Df Sum of Sq RSS AIC
## + wt 1 847.73 278.32 73.217
## + cyl 1 817.71 308.33 76.494
## + disp 1 808.89 317.16 77.397
## + hp 1 678.37 447.67 88.427
## + drat 1 522.48 603.57 97.988
## + vs 1 496.53 629.52 99.335
## + am 1 405.15 720.90 103.672
## + carb 1 341.78 784.27 106.369
## + gear 1 259.75 866.30 109.552
## + qsec 1 197.39 928.66 111.776
## <none> 1126.05 115.943
##
## Step: AIC=73.22
## mpg ~ wt
##
## Df Sum of Sq RSS AIC
## + cyl 1 87.15 191.17 63.198
## + hp 1 83.27 195.05 63.840
## + qsec 1 82.86 195.46 63.908
## + vs 1 54.23 224.09 68.283
## + carb 1 44.60 233.72 69.628
## + disp 1 31.64 246.68 71.356
## <none> 278.32 73.217
## + drat 1 9.08 269.24 74.156
## + gear 1 1.14 277.19 75.086
## + am 1 0.00 278.32 75.217
## - wt 1 847.73 1126.05 115.943
##
## Step: AIC=63.2
## mpg ~ wt + cyl
##
## Df Sum of Sq RSS AIC
## + hp 1 14.551 176.62 62.665
## + carb 1 13.772 177.40 62.805
## <none> 191.17 63.198
## + qsec 1 10.567 180.60 63.378
## + gear 1 3.028 188.14 64.687
## + disp 1 2.680 188.49 64.746
## + vs 1 0.706 190.47 65.080
## + am 1 0.125 191.05 65.177
## + drat 1 0.001 191.17 65.198
## - cyl 1 87.150 278.32 73.217
## - wt 1 117.162 308.33 76.494
##
## Step: AIC=62.66
## mpg ~ wt + cyl + hp
##
## Df Sum of Sq RSS AIC
## <none> 176.62 62.665
## - hp 1 14.551 191.17 63.198
## + am 1 6.623 170.00 63.442
## + disp 1 6.176 170.44 63.526
## - cyl 1 18.427 195.05 63.840
## + carb 1 2.519 174.10 64.205
## + drat 1 2.245 174.38 64.255
## + qsec 1 1.401 175.22 64.410
## + gear 1 0.856 175.76 64.509
## + vs 1 0.060 176.56 64.654
## - wt 1 115.354 291.98 76.750
bestCarsModel <- lm(formula(modelTesting), mtcars)
summary(bestCarsModel)
##
## Call:
## lm(formula = formula(modelTesting), data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9290 -1.5598 -0.5311 1.1850 5.8986
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.75179 1.78686 21.687 < 2e-16 ***
## wt -3.16697 0.74058 -4.276 0.000199 ***
## cyl -0.94162 0.55092 -1.709 0.098480 .
## hp -0.01804 0.01188 -1.519 0.140015
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.512 on 28 degrees of freedom
## Multiple R-squared: 0.8431, Adjusted R-squared: 0.8263
## F-statistic: 50.17 on 3 and 28 DF, p-value: 2.184e-11
The equation for the best model for predicting mpg is the estimated mpg = 38.7517874 + -3.1669731 * wt + -0.9416168 * cyl + -0.0180381 * hp. The adjusted R2 value is 0.8263446, which shows that 82.6344624% of the model can be explained by the predictor variables – which are wt, cyl, hp – that we are given.
As we can see there is a difference in the coefficient values for wt and cyl in both types of models. When modeled individually, we can see that both predictors have a higher weight (or leading coefficient value) than when it is modeled as one multiple linear regression model. We can see a table below of the difference for both the wt and cyl predictor variables as to how it changes as a simple and multiple linear model.
`Variable Names` <- c("wt", "cyl")
`Simple Linear Coefficient Values` <- c(carsModel1$coefficients[2], carsModel2$coefficients[2])
`Multiple Linear Coefficient Values` <- bestCarsModel$coefficients[c(2, 3)]
coefficientTable <- as_tibble(cbind(`Variable Names`, `Simple Linear Coefficient Values`, `Multiple Linear Coefficient Values`))
knitr::kable(coefficientTable, "pipe")
| Variable Names | Simple Linear Coefficient Values | Multiple Linear Coefficient Values |
|---|---|---|
| wt | -5.34447157272267 | -3.16697311074858 |
| cyl | -2.87579013906447 | -0.941616811990739 |
The table emphasizes that the coefficient values have gotten a lot smaller. This is because adding more variables to a model will decrease the chance of mistakenly attributing an effect of a variable. By not omitting variables, we should get closer to the true effect a variable has on one another, which is why multiple linear regression will tend to have a lower coefficient value than simple linear regression.
We are going to use a model that looks like this:
\[ Predicted salary = 50 + 20 * GPA + 0.07 * IQ + 35 * Education Level + 0.01 * (GPA * IQ) - 10 * (GPA * Eductation Level) \]
If we know that IQ and GPA are fixed then we can rewrite a regression equation to only evaluate the difference on projected salary based on education level. So now it is:
\[ Predicted Difference in Salary = 35 * Education Level - 10 * (GPA * Education Level) \]
Furthermore, we can look at both the high school and college salaries individually if the GPA and IQ are fixed.
\[ Projected Difference College Salary = 35 - 10 * GPA \] \[ Projected Difference High School Salary = 0 \]
So thus, we can see that by setting the equations to equal to each other to see who might make more:
\[ 35 - 10 * GPA = 0 \] or \[ GPA = 3.5 \]
From this, we can see that if GPA is higher than a 3.5, then the high school graduate will make more money, on average, and if less than 3.5 GPA then the college graduate makes more, on average. So based on this, we can see that answer choice (iii) is correct in saying: “For a fixed value of IQ and GPA, high school graduates earn more, on average, than college graduates provided that the GPA is high enough.”
Here we are going to predict the salary of a college graduate with IQ of 110 and a GPA of 4.0.
50 + 20 * 4.0 + 0.07 * 110 + 35 * 1 + 0.01 * (4.0 * 110) - 10 * (4.0 * 1) # Note this is in thousands of dollars
## [1] 137.1
We can see by using the regression equation and plugging in the appropriate values it is asking about that someone with a 4.0 GPA with a 110 IQ score from college will make, on average (or predicted), $137,100.
Question: True or false: Since the coefficient for the GPA/IQ interaction term is very small, there is very little evidence of an interaction effect. Justify your answer.
Answer : This statement is false, because the value of the coefficient for a term does not provide evidence for or against the effect it has on the regression model. We would need to compute the p-value for the coefficient to determine this.
Here we are going to read in and use the Auto data.
library(ISLR)
Auto <- as_tibble(Auto)
Here we are going to use the lm() function to perform a simple linear regression with mpg as the response and horsepower as the predictor. Use the summary() function to print the results.
mpg2hp <- lm(mpg ~ horsepower, Auto)
summary(mpg2hp)
##
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5710 -3.2592 -0.3435 2.7630 16.9240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.935861 0.717499 55.66 <2e-16 ***
## horsepower -0.157845 0.006446 -24.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
## F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
We can see that there is a relationship between mpg and horsepower, since the p-value is 7.031989^{-81} which is much less than any alpha level we could use. Thus, there is a relationship present since there is statistically significant evidence that the coefficient value is not 0.
We can see that the correlation between mpg and horsepower is -0.7784268. This is a fairly strong, negative relationship.
Since the coefficient of horsepower in the model is negative and the correlation is also negative, we can conclude that the relationship between horsepower and mpg is negative or as horsepower increases, mpg is predicted to decrease.
Here we can see what predicted values of mpg are if horsepower is 98.
fittedValue <- mpg2hp$coefficients[[1]] + mpg2hp$coefficients[[2]] * 98 # Predicted mpg
confidenceValue <- predict(mpg2hp,
tibble(horsepower = 98),
level = 0.95,
interval = "confidence") # 95% confidence level
predictionValues <- predict(mpg2hp,
tibble(horsepower = 98),
level = 0.95,
interval = "prediction") # 95% prediction level
Here we can see that the projected or fitted value that we would expect a 98 horsepower car to get is 24.47. We would also make a 95% confidence interval or be 95% confident that, on average, cars with 98 horsepower would get between 23.97 and 24.96 mpg. If we had to predict on what one car with 98 horsepower, we would expect that this car would get between 14.81 and 34.12 mpg.
Here we are going to plot the relationship between horsepower and mpg.
Auto %>%
ggplot(aes(x = horsepower,
y = mpg)) +
geom_point(alpha = 0.3,
color = "blue") +
geom_smooth(method = "lm",
se = FALSE,
color = "red") +
labs(title = "Relationship Between Horsepower and MPG",
caption = "Eric Warren",
x = "Horsepower",
y = "MPG") +
theme_bw() +
theme(plot.title = element_text(color = "blue",
face = "bold",
size = 16,
hjust = 0.5))
## `geom_smooth()` using formula 'y ~ x'
From this plot, it looks like an exponential decay relationship, but we will examine this on the next part.
We are going to plot the diagnostics to see if this passes all the assumptions of linear regression.
par(mfrow = c(2, 2))
plot(mpg2hp,
main = "Diagnostics Plots for the Horsepower and MPG Relationship")
The residuals do not follow homoskedasticity, as we can see a pattern in the residuals. Due to this, the assumption of normality is violated. This supports my point in part B that thought this was not a linear representation of the data.
Here we are going to make a scatterplot matrix of all the variables in the data set.
pairs(Auto,
main = "Scatterplot Matrix between all the Variables in Auto Data")
Here we are going to get the correlations of all the variables of the Auto data set.
cor(Auto[ , unlist(lapply(Auto, is.numeric))])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
We can see that the weight, horsepower, displacement, and cylinders have a strong relationship on mpg.
Here we are going to make a multiple linear regression model of all the numeric predictor variables.
mpg2numericAll <- lm(mpg ~ . - name, Auto)
summary(mpg2numericAll)
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
There is a relationship between the predictors and the response as we could have guessed. We can see that the coefficient values are not zero and there are p-values less than 0.05.
The variables that have a significant significant relationship at an alpha level of 0.05 are displacement, weight, year, and origin.
The coefficient of year tells us that for every year newer or later the model of the car is, we are expecting to get about 0.75 mpg more.
We are going to make a diagnostics plot of the multiple linear regression plot.
par(mfrow = c(2, 2))
plot(mpg2hp,
main = "Diagnostics Plots for MPG's Relationship \nOn Numeric Variables")
It seems that the residuals might have a pattern with a quadratic curve. The residuals show that there are some outliers in the newer cars. There are also some cars with some abnormally high leverage, like observation 116.
Here we are going to make a new multiple linear regression model with all the interaction terms the are numeric.
mpg2numericAllWithInteraction <- lm(mpg ~ (. - name) ^ 2, Auto)
summary(mpg2numericAllWithInteraction)
##
## Call:
## lm(formula = mpg ~ (. - name)^2, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.6303 -1.4481 0.0596 1.2739 11.1386
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.548e+01 5.314e+01 0.668 0.50475
## cylinders 6.989e+00 8.248e+00 0.847 0.39738
## displacement -4.785e-01 1.894e-01 -2.527 0.01192 *
## horsepower 5.034e-01 3.470e-01 1.451 0.14769
## weight 4.133e-03 1.759e-02 0.235 0.81442
## acceleration -5.859e+00 2.174e+00 -2.696 0.00735 **
## year 6.974e-01 6.097e-01 1.144 0.25340
## origin -2.090e+01 7.097e+00 -2.944 0.00345 **
## cylinders:displacement -3.383e-03 6.455e-03 -0.524 0.60051
## cylinders:horsepower 1.161e-02 2.420e-02 0.480 0.63157
## cylinders:weight 3.575e-04 8.955e-04 0.399 0.69000
## cylinders:acceleration 2.779e-01 1.664e-01 1.670 0.09584 .
## cylinders:year -1.741e-01 9.714e-02 -1.793 0.07389 .
## cylinders:origin 4.022e-01 4.926e-01 0.816 0.41482
## displacement:horsepower -8.491e-05 2.885e-04 -0.294 0.76867
## displacement:weight 2.472e-05 1.470e-05 1.682 0.09342 .
## displacement:acceleration -3.479e-03 3.342e-03 -1.041 0.29853
## displacement:year 5.934e-03 2.391e-03 2.482 0.01352 *
## displacement:origin 2.398e-02 1.947e-02 1.232 0.21875
## horsepower:weight -1.968e-05 2.924e-05 -0.673 0.50124
## horsepower:acceleration -7.213e-03 3.719e-03 -1.939 0.05325 .
## horsepower:year -5.838e-03 3.938e-03 -1.482 0.13916
## horsepower:origin 2.233e-03 2.930e-02 0.076 0.93931
## weight:acceleration 2.346e-04 2.289e-04 1.025 0.30596
## weight:year -2.245e-04 2.127e-04 -1.056 0.29182
## weight:origin -5.789e-04 1.591e-03 -0.364 0.71623
## acceleration:year 5.562e-02 2.558e-02 2.174 0.03033 *
## acceleration:origin 4.583e-01 1.567e-01 2.926 0.00365 **
## year:origin 1.393e-01 7.399e-02 1.882 0.06062 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.695 on 363 degrees of freedom
## Multiple R-squared: 0.8893, Adjusted R-squared: 0.8808
## F-statistic: 104.2 on 28 and 363 DF, p-value: < 2.2e-16
We can see that using an alpha level of 0.05, there are some interaction terms that are statistically significant in the model. The interaction terms are:
Here we are going to try some different transformations to see how they look. Note these are without the interaction terms.
Here we are going to examine how taking the log of the predictors look.
AutoLog <- Auto %>%
mutate_at(2:8, list(log = ~ log(.))) %>%
dplyr::select(mpg,
name,
contains("log"))
mpg2numericAllLog <- lm(mpg ~ . - name, AutoLog)
summary(mpg2numericAllLog)
##
## Call:
## lm(formula = mpg ~ . - name, data = AutoLog)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5987 -1.8172 -0.0181 1.5906 12.8132
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -66.5643 17.5053 -3.803 0.000167 ***
## cylinders_log 1.4818 1.6589 0.893 0.372273
## displacement_log -1.0551 1.5385 -0.686 0.493230
## horsepower_log -6.9657 1.5569 -4.474 1.01e-05 ***
## weight_log -12.5728 2.2251 -5.650 3.12e-08 ***
## acceleration_log -4.9831 1.6078 -3.099 0.002082 **
## year_log 54.9857 3.5555 15.465 < 2e-16 ***
## origin_log 1.5822 0.5083 3.113 0.001991 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.069 on 384 degrees of freedom
## Multiple R-squared: 0.8482, Adjusted R-squared: 0.8454
## F-statistic: 306.5 on 7 and 384 DF, p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(mpg2numericAllLog,
main = "Diagnostics Plots for MPG's Relationship \nOn Numeric Variables in Log Form")
Doing the log transformations on the predictor variables tends to look better in terms of normality. It also seems to pass most of the other assumptions of linear regression and might be a better model to use when predicting the mpg of a car. Also note that taking the log of cylinders and log of displacement shows these variables aren’t the best at predicting mpg.
Here we are going to examine how taking the squared values of the predictors look.
AutoSquared <- Auto %>%
mutate_at(2:8, list(squared = ~ (.)**2)) %>%
dplyr::select(mpg,
name,
contains("squared"))
mpg2numericAllSquared <- lm(mpg ~ . - name, AutoSquared)
summary(mpg2numericAllSquared)
##
## Call:
## lm(formula = mpg ~ . - name, data = AutoSquared)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.6786 -2.3227 -0.0582 1.9073 12.9807
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.208e+00 2.356e+00 0.513 0.608382
## cylinders_squared -8.829e-02 2.521e-02 -3.502 0.000515 ***
## displacement_squared 5.680e-05 1.382e-05 4.109 4.87e-05 ***
## horsepower_squared -3.621e-05 4.975e-05 -0.728 0.467201
## weight_squared -9.351e-07 8.978e-08 -10.416 < 2e-16 ***
## acceleration_squared 6.278e-03 2.690e-03 2.334 0.020130 *
## year_squared 4.999e-03 3.530e-04 14.160 < 2e-16 ***
## origin_squared 4.129e-01 6.914e-02 5.971 5.37e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.539 on 384 degrees of freedom
## Multiple R-squared: 0.7981, Adjusted R-squared: 0.7944
## F-statistic: 216.8 on 7 and 384 DF, p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(mpg2numericAllSquared,
main = "Diagnostics Plots for MPG's Relationship \nOn Numeric Variables in Squared Form")
It looks like the normality assumption for linear regression is violated. Due to this, squaring the predictor variables does not seem to help.
Here we are going to examine how taking the square root values of the predictors look.
AutoSquareRoot <- Auto %>%
mutate_at(2:8, list(squareRoot = ~ (.)**0.5)) %>%
dplyr::select(mpg,
name,
contains("squareRoot"))
mpg2numericAllSquareRoot <- lm(mpg ~ . - name, AutoSquareRoot)
summary(mpg2numericAllSquareRoot)
##
## Call:
## lm(formula = mpg ~ . - name, data = AutoSquareRoot)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5250 -1.9822 -0.1111 1.7347 13.0681
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -49.79814 9.17832 -5.426 1.02e-07 ***
## cylinders_squareRoot -0.23699 1.53753 -0.154 0.8776
## displacement_squareRoot 0.22580 0.22940 0.984 0.3256
## horsepower_squareRoot -0.77976 0.30788 -2.533 0.0117 *
## weight_squareRoot -0.62172 0.07898 -7.872 3.59e-14 ***
## acceleration_squareRoot -0.82529 0.83443 -0.989 0.3233
## year_squareRoot 12.79030 0.85891 14.891 < 2e-16 ***
## origin_squareRoot 3.26036 0.76767 4.247 2.72e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.21 on 384 degrees of freedom
## Multiple R-squared: 0.8338, Adjusted R-squared: 0.8308
## F-statistic: 275.3 on 7 and 384 DF, p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(mpg2numericAllSquareRoot,
main = "Diagnostics Plots for MPG's Relationship \nOn Numeric Variables in Square Root Form")
This does not look bad in terms of normality but there still looks to be some patterns for the residuals. This might not be a bad transformation to use, but I would prefer to use the log transformation.
Here we are going to read in the Carseats data.
Carseats <- as_tibble(Carseats)
Here I am going to fit a multiple linear regression model using the
Price, Urban, and US to predict
Sales.
carSalesModel <- lm(Sales ~ Price + Urban + US, Carseats)
Here we are going to interpret our model.
summary(carSalesModel)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
Looking at the (Intercept), we are saying that if we look at nothing else, or everything is constant, we are estimating that we will sell 13.0434689 thousands of car seats or about 13,043 in total.
Looking at the Price, we are saying that if we look at nothing else, or everything is constant, we are estimating that we will have a change in sales of -0.0544588 thousands of car seats or a decrease in sales of about 54 in total per one dollar we increase our prices of our car seats.
Looking at the UrbanYes, we are saying that if we look at nothing else, or everything is constant, we are estimating that we will have a change in sales of -0.0219162 thousands of car seats or a decrease in sales of about 22 in total if the store is Urban, compared to others that are not.
Looking at the USYes, we are saying that if we look at nothing else, or everything is constant, we are estimating that we will have a change in sales of 1.2005727 thousands of car seats or an increase in sales of about 1,201 in total if the store is in the US, compared to others that are not.
The equation of this regression model is the predicted sales of car seats in thousands = 13.0434689 + -0.0544588 * Price we are selling the seats + -0.0219162 * If the store is in an Urban area + 1.2005727 * If the store is in the US.
Note: The Urban and US variables are either 0 if no or 1 if yes, since they are qualitative.
Looking back to our summary table in Part B, we can
see that the predictor variables that are significantly significant to
have an affect on the model are Price and US,
since their p-values are less than 0.0001.
Now we are going to fit a new model taking out the Urban
predictor since it was not statistically significant in helping our
prediction.
carSalesModelUpdated <- lm(Sales ~ Price + US, Carseats)
We are going to look at the summaries of the models to see how well both of them are being fitted.
summary(carSalesModel)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
summary(carSalesModelUpdated)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
As we can see, the adjusted R2 for the model with all 3 predictors is 0.2335123 which shows that 23.3512327% of the model can be explained by the predictor variables we are given. For the model that uses only US and Price as predictors, we can see that the adjusted R2 is 0.2354305 which shows that 23.543046% of the model can be explained by the predictor variables we are given. Having the model representing such a low percentage is not a good model, and thus both models fit the data quite poorly in my opinion.
Here we are going to obtain 95% confidence levels of the coefficients for each predictor.
confidencePrice <- confint(carSalesModelUpdated,
"Price",
level = 0.95)
confidenceUS <- confint(carSalesModelUpdated,
"USYes",
level = 0.95)
Looking at price, we are 95% confident that the price coefficient is between -0.0647598 and -0.0441954 or that if we keep everything but price constant that we will lose between 44 and 65 car seat units being sold.
Looking at if the store is in the US, we are 95% confident that the price coefficient is between 0.6915196 and 1.7077663 or that if we keep everything but the location of the store in the US constant that we will gain between 692 and 1,708 in car seat units being sold.
Here we are going to examine if there are any outliers or evidence of high leverage observations.
par(mfrow = c(2, 2))
plot(carSalesModelUpdated,
main = "Diagnostics Plots for Car Seats Sold and \nRelationship to Price and Store Location")
From this, we can see that observations 69 and 377 tend to be constant outliers and the high leverage observations are 26, 50, and 368. Everything else seems to look good and the model to looks to meet all assumptions.
Here we are going to read in the Boston data set.
Boston <- as_tibble(Boston)
There are many simple linear regression models we are going to have to make.
First, look at how crime is affected by residential zones.
crimeOnZone <- lm(crim ~ zn, Boston)
summary(crimeOnZone)
##
## Call:
## lm(formula = crim ~ zn, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.429 -4.222 -2.620 1.250 84.523
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.45369 0.41722 10.675 < 2e-16 ***
## zn -0.07393 0.01609 -4.594 5.51e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.435 on 504 degrees of freedom
## Multiple R-squared: 0.04019, Adjusted R-squared: 0.03828
## F-statistic: 21.1 on 1 and 504 DF, p-value: 5.506e-06
par(mfrow = c(2, 2))
plot(crimeOnZone,
main = "Diagnostics Plots for Per Capita \nCrime Rate and Zone")
Here we can see that for every percentage of land is zoned, the predicted per capita crime rate goes down by about 0.0739. This variable has an effect on crime rate. It seems that the normality assumption is violated and that doing linear regression on this variable is not a good idea.
Next, look at how crime is affected by non-retail business.
crimeOnIndus <- lm(crim ~ indus, Boston)
summary(crimeOnIndus)
##
## Call:
## lm(formula = crim ~ indus, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.972 -2.698 -0.736 0.712 81.813
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.06374 0.66723 -3.093 0.00209 **
## indus 0.50978 0.05102 9.991 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.866 on 504 degrees of freedom
## Multiple R-squared: 0.1653, Adjusted R-squared: 0.1637
## F-statistic: 99.82 on 1 and 504 DF, p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(crimeOnIndus,
main = "Diagnostics Plots for Per Capita \nCrime Rate and Industrial Zone")
Here we can see that for every percentage of land is non-retail, the predicted per capita crime rate goes up by about 0.5098. This variable has an effect on crime rate. It seems that the normality assumption is violated and that doing linear regression on this variable is not a good idea.
Next, look at how crime is affected by the river.
crimeOnChas <- lm(crim ~ chas, Boston)
summary(crimeOnChas)
##
## Call:
## lm(formula = crim ~ chas, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.738 -3.661 -3.435 0.018 85.232
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.7444 0.3961 9.453 <2e-16 ***
## chas -1.8928 1.5061 -1.257 0.209
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.597 on 504 degrees of freedom
## Multiple R-squared: 0.003124, Adjusted R-squared: 0.001146
## F-statistic: 1.579 on 1 and 504 DF, p-value: 0.2094
par(mfrow = c(2, 2))
plot(crimeOnChas,
main = "Diagnostics Plots for Per Capita \nCrime Rate and River Bank")
Here we can see that if it bounds the river, the predicted per capita crime rate goes down by about 1.8928. This variable has an effect on crime rate. It seems that the normality assumption is violated and that doing linear regression on this variable is not a good idea.
Next, look at how crime is affected by the nitrogen oxides.
crimeOnNox <- lm(crim ~ nox, Boston)
summary(crimeOnNox)
##
## Call:
## lm(formula = crim ~ nox, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.371 -2.738 -0.974 0.559 81.728
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -13.720 1.699 -8.073 5.08e-15 ***
## nox 31.249 2.999 10.419 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.81 on 504 degrees of freedom
## Multiple R-squared: 0.1772, Adjusted R-squared: 0.1756
## F-statistic: 108.6 on 1 and 504 DF, p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(crimeOnNox,
main = "Diagnostics Plots for Per Capita \nCrime Rate and Nitrogen Oxide")
Here we can see that for every one part the nitrogen oxides go up per 10 million parts, the predicted per capita crime rate goes up by about 31.2485. This variable has an effect on crime rate. It seems that the normality assumption is violated and that doing linear regression on this variable is not a good idea.
Next, look at how crime is affected by the number of rooms per dwelling.
crimeOnRm <- lm(crim ~ rm, Boston)
summary(crimeOnRm)
##
## Call:
## lm(formula = crim ~ rm, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.604 -3.952 -2.654 0.989 87.197
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.482 3.365 6.088 2.27e-09 ***
## rm -2.684 0.532 -5.045 6.35e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.401 on 504 degrees of freedom
## Multiple R-squared: 0.04807, Adjusted R-squared: 0.04618
## F-statistic: 25.45 on 1 and 504 DF, p-value: 6.347e-07
par(mfrow = c(2, 2))
plot(crimeOnRm,
main = "Diagnostics Plots for Per Capita \nCrime Rate and Number of Rooms")
Here we can see that for every additional room in a dwelling, the predicted per capita crime rate goes down by about 2.6841. This variable has an effect on crime rate. It seems that the normality assumption is violated and that doing linear regression on this variable is not a good idea.
Next, look at how crime is affected by the proportion of units built prior to 1940.
crimeOnAge <- lm(crim ~ age, Boston)
summary(crimeOnAge)
##
## Call:
## lm(formula = crim ~ age, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.789 -4.257 -1.230 1.527 82.849
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.77791 0.94398 -4.002 7.22e-05 ***
## age 0.10779 0.01274 8.463 2.85e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.057 on 504 degrees of freedom
## Multiple R-squared: 0.1244, Adjusted R-squared: 0.1227
## F-statistic: 71.62 on 1 and 504 DF, p-value: 2.855e-16
par(mfrow = c(2, 2))
plot(crimeOnAge,
main = "Diagnostics Plots for Per Capita \nCrime Rate and Building Age")
Here we can see that for every additional percentage of houses built before 1940, the predicted per capita crime rate goes up by about 0.1078. This variable has an effect on crime rate. It seems that the normality assumption is violated and that doing linear regression on this variable is not a good idea.
Next, look at how crime is affected by the weighted mean distance from Boston centers.
crimeOnDis <- lm(crim ~ dis, Boston)
summary(crimeOnDis)
##
## Call:
## lm(formula = crim ~ dis, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.708 -4.134 -1.527 1.516 81.674
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.4993 0.7304 13.006 <2e-16 ***
## dis -1.5509 0.1683 -9.213 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.965 on 504 degrees of freedom
## Multiple R-squared: 0.1441, Adjusted R-squared: 0.1425
## F-statistic: 84.89 on 1 and 504 DF, p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(crimeOnDis,
main = "Diagnostics Plots for Per Capita \nCrime Rate and Mean Distance from Centers")
Here we can see that for every additional mile the weighted distance is from the Boston centers, the predicted per capita crime rate goes down by about 1.5509. This variable has an effect on crime rate. It seems that the normality assumption is violated and that doing linear regression on this variable is not a good idea.
Next, look at how crime is affected by the accessibility to radical highways.
crimeOnRad <- lm(crim ~ rad, Boston)
summary(crimeOnRad)
##
## Call:
## lm(formula = crim ~ rad, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.164 -1.381 -0.141 0.660 76.433
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.28716 0.44348 -5.157 3.61e-07 ***
## rad 0.61791 0.03433 17.998 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.718 on 504 degrees of freedom
## Multiple R-squared: 0.3913, Adjusted R-squared: 0.39
## F-statistic: 323.9 on 1 and 504 DF, p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(crimeOnRad,
main = "Diagnostics Plots for Per Capita \nCrime Rate and Index of Accessibility to Radical Highways")
Here we can see that for every additional index point the accessibility to radical highways increases, the predicted per capita crime rate goes up by about 0.6179. This variable has an effect on crime rate. It seems that the normality assumption is violated and that doing linear regression on this variable is not a good idea.
Next, look at how crime is affected by the property tax rate.
crimeOnTax <- lm(crim ~ tax, Boston)
summary(crimeOnTax)
##
## Call:
## lm(formula = crim ~ tax, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.513 -2.738 -0.194 1.065 77.696
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.528369 0.815809 -10.45 <2e-16 ***
## tax 0.029742 0.001847 16.10 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.997 on 504 degrees of freedom
## Multiple R-squared: 0.3396, Adjusted R-squared: 0.3383
## F-statistic: 259.2 on 1 and 504 DF, p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(crimeOnTax,
main = "Diagnostics Plots for Per Capita \nCrime Rate and Tax Rate")
Here we can see that for every additional dollar in property tax rate per 10,000 dollars, the predicted per capita crime rate goes up by about 0.0297. This variable has an effect on crime rate. It seems that the normality assumption is violated and that doing linear regression on this variable is not a good idea.
Next, look at how crime is affected by pupil-teacher ratio by town.
crimeOnPtratio <- lm(crim ~ ptratio, Boston)
summary(crimeOnPtratio)
##
## Call:
## lm(formula = crim ~ ptratio, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.654 -3.985 -1.912 1.825 83.353
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.6469 3.1473 -5.607 3.40e-08 ***
## ptratio 1.1520 0.1694 6.801 2.94e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.24 on 504 degrees of freedom
## Multiple R-squared: 0.08407, Adjusted R-squared: 0.08225
## F-statistic: 46.26 on 1 and 504 DF, p-value: 2.943e-11
par(mfrow = c(2, 2))
plot(crimeOnPtratio,
main = "Diagnostics Plots for Per Capita \nCrime Rate and Pupil to Teacher Ratio")
Here we can see that for every additional student per teacher, the predicted per capita crime rate goes up by about 1.152. This variable has an effect on crime rate. It seems that the normality assumption is violated and that doing linear regression on this variable is not a good idea.
Next, look at how crime is affected by the proportion of African-Americans.
crimeOnBlack <- lm(crim ~ black, Boston)
summary(crimeOnBlack)
##
## Call:
## lm(formula = crim ~ black, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.756 -2.299 -2.095 -1.296 86.822
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.553529 1.425903 11.609 <2e-16 ***
## black -0.036280 0.003873 -9.367 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.946 on 504 degrees of freedom
## Multiple R-squared: 0.1483, Adjusted R-squared: 0.1466
## F-statistic: 87.74 on 1 and 504 DF, p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(crimeOnBlack,
main = "Diagnostics Plots for Per Capita \nCrime Rate and Proportion of African-Americans")
Here we can see that for every additional African-American proportional point, the predicted per capita crime rate goes up by about 0.0363. This variable has an effect on crime rate. It seems that the normality assumption is violated and that doing linear regression on this variable is not a good idea.
Next, look at how crime is affected by the percentage of the lower status of population.
crimeOnLstat <- lm(crim ~ lstat, Boston)
summary(crimeOnLstat)
##
## Call:
## lm(formula = crim ~ lstat, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.925 -2.822 -0.664 1.079 82.862
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.33054 0.69376 -4.801 2.09e-06 ***
## lstat 0.54880 0.04776 11.491 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.664 on 504 degrees of freedom
## Multiple R-squared: 0.2076, Adjusted R-squared: 0.206
## F-statistic: 132 on 1 and 504 DF, p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(crimeOnLstat,
main = "Diagnostics Plots for Per Capita \nCrime Rate and Percentage of Lower Status")
Here we can see that for every additional percentage of lower status of the population is in an area, the predicted per capita crime rate goes up by about 0.5488. This variable has an effect on crime rate. It seems that the normality assumption is violated and that doing linear regression on this variable is not a good idea.
Lastly, look at how crime is affected by the median value of owned homes in thousands of dollars.
crimeOnMedv <- lm(crim ~ medv, Boston)
summary(crimeOnMedv)
##
## Call:
## lm(formula = crim ~ medv, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.071 -4.022 -2.343 1.298 80.957
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.79654 0.93419 12.63 <2e-16 ***
## medv -0.36316 0.03839 -9.46 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.934 on 504 degrees of freedom
## Multiple R-squared: 0.1508, Adjusted R-squared: 0.1491
## F-statistic: 89.49 on 1 and 504 DF, p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(crimeOnMedv,
main = "Diagnostics Plots for Per Capita \nCrime Rate and Median House Price of Owned Homes")
Here we can see that for every additional thosand dollars the median value of owned homes goes up, the predicted per capita crime rate goes up by about 0.3632. This variable has an effect on crime rate. It seems that the normality assumption is violated and that doing linear regression on this variable is not a good idea.
As we can see all the predictor variables when compared to themselves has a significantly significant effect on the per capita crime rate. Though, none of these variables meet all of the assumptions needed for linear regression modeling, especially the normality assumption.
Here we are going to fit a mulitple linear regression model of all the predictor variables to see how this combined model looks.
crimeOnAllModel <- lm(crim ~ ., Boston)
summary(crimeOnAllModel)
##
## Call:
## lm(formula = crim ~ ., data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.924 -2.120 -0.353 1.019 75.051
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.033228 7.234903 2.354 0.018949 *
## zn 0.044855 0.018734 2.394 0.017025 *
## indus -0.063855 0.083407 -0.766 0.444294
## chas -0.749134 1.180147 -0.635 0.525867
## nox -10.313535 5.275536 -1.955 0.051152 .
## rm 0.430131 0.612830 0.702 0.483089
## age 0.001452 0.017925 0.081 0.935488
## dis -0.987176 0.281817 -3.503 0.000502 ***
## rad 0.588209 0.088049 6.680 6.46e-11 ***
## tax -0.003780 0.005156 -0.733 0.463793
## ptratio -0.271081 0.186450 -1.454 0.146611
## black -0.007538 0.003673 -2.052 0.040702 *
## lstat 0.126211 0.075725 1.667 0.096208 .
## medv -0.198887 0.060516 -3.287 0.001087 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.439 on 492 degrees of freedom
## Multiple R-squared: 0.454, Adjusted R-squared: 0.4396
## F-statistic: 31.47 on 13 and 492 DF, p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(crimeOnAllModel,
main = "Diagnostics Plots for Per Capita \nCrime Rate on All Predictor Variables")
The following variables that have significantly significant evidence to conclude that they have an affect on the model are zn, dis, rad, black, medv at an alpha level of 0.05. We can see that many of the variables are not needed for prediction and the adjusted R2 value is pretty low, which shows there is a lot of error that describes the crime rate. It also tends to follow normality alright, minus some outliers, so we might have to proceed with caution when using this model for prediction, because it doesn’t look entirely normal (an assumption needed to use this model).
Here we are going to compare how all of the predictor coefficients when predicted by themselves and together as one model.
`Variable Names` <- c("zn", "indus", "chas", "nox", "rm", "age", "dis", "rad", "tax", "ptratio", "black", "lstat", "medv")
`Coefficient Values on Simple Model` <- as.numeric(
c(round(crimeOnZone$coefficients[[2]], 4),
round(crimeOnIndus$coefficients[[2]], 4),
round(crimeOnChas$coefficients[[2]], 4),
round(crimeOnNox$coefficients[[2]], 4),
round(crimeOnRm$coefficients[[2]], 4),
round(crimeOnAge$coefficients[[2]], 4),
round(crimeOnDis$coefficients[[2]], 4),
round(crimeOnRad$coefficients[[2]], 4),
round(crimeOnTax$coefficients[[2]], 4),
round(crimeOnPtratio$coefficients[[2]], 4),
round(crimeOnBlack$coefficients[[2]], 4),
round(crimeOnLstat$coefficients[[2]], 4),
round(crimeOnMedv$coefficients[[2]], 4)
)
)
`Coefficient Values on Multiple Model` <- as.numeric(unname(round(crimeOnAllModel$coefficients[c(2:14)], 4)))
coefficientTableCrime <- as_tibble(cbind(`Variable Names`, `Coefficient Values on Simple Model`,`Coefficient Values on Multiple Model`))
coefficientTableCrime$`Coefficient Values on Simple Model` <- as.numeric(coefficientTableCrime$`Coefficient Values on Simple Model`)
coefficientTableCrime$`Coefficient Values on Multiple Model` <- as.numeric(coefficientTableCrime$`Coefficient Values on Multiple Model`)
# Make table
knitr::kable(coefficientTableCrime, "pipe")
| Variable Names | Coefficient Values on Simple Model | Coefficient Values on Multiple Model |
|---|---|---|
| zn | -0.0739 | 0.0449 |
| indus | 0.5098 | -0.0639 |
| chas | -1.8928 | -0.7491 |
| nox | 31.2485 | -10.3135 |
| rm | -2.6841 | 0.4301 |
| age | 0.1078 | 0.0015 |
| dis | -1.5509 | -0.9872 |
| rad | 0.6179 | 0.5882 |
| tax | 0.0297 | -0.0038 |
| ptratio | 1.1520 | -0.2711 |
| black | -0.0363 | -0.0075 |
| lstat | 0.5488 | 0.1262 |
| medv | -0.3632 | -0.1989 |
# Make plot with comparison
coefficientTableCrime %>%
ggplot(aes(x = `Coefficient Values on Simple Model`,
y = `Coefficient Values on Multiple Model`,
color = `Variable Names`)) +
geom_point(alpha = 0.3) +
labs(title = "Comparing Coefficient Values on Simple \nand Multiple Linear Regression Models",
caption = "Eric Warren",
x = "Simple Linear Regression Model Coefficient Values",
y = "Multiple Linear Regression Model Coefficient Values",
color = "Variable Names") +
theme_bw() +
theme(plot.title = element_text(color = "blue",
face = "bold",
size = 16,
hjust = 0.5))
We can see that the absolute value of all the predictors coefficients multiple linear regression model are less than the absolute value of the predictors coefficients in the simple linear regression model. I found it interesting that the sign did flip for the zn, indus, nox, rm, tax, and ptratio – in which all of these variables, except zn, did not have significant affect on the multiple linear regression model. The exception of zn, was so close to zero, which might have caused this. But the major takeaway is the effect that the predictor variables have on the per capita crime rate is much smaller when in a multiple linear regression model, rather than by itself in a simple linear regression model.
Here we are going to see if there is non-linear association between any of the variables.
First, look at how crime is affected by residential zones.
crimeOnZoneCubed <- lm(crim ~ poly(zn, 3), Boston)
summary(crimeOnZoneCubed)
##
## Call:
## lm(formula = crim ~ poly(zn, 3), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.821 -4.614 -1.294 0.473 84.130
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.6135 0.3722 9.709 < 2e-16 ***
## poly(zn, 3)1 -38.7498 8.3722 -4.628 4.7e-06 ***
## poly(zn, 3)2 23.9398 8.3722 2.859 0.00442 **
## poly(zn, 3)3 -10.0719 8.3722 -1.203 0.22954
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.372 on 502 degrees of freedom
## Multiple R-squared: 0.05824, Adjusted R-squared: 0.05261
## F-statistic: 10.35 on 3 and 502 DF, p-value: 1.281e-06
We can see that in for the residential zones variable the squared term is statistically significant so this model has non-linear association.
Next, look at how crime is affected by non-retail business.
crimeOnIndusCubed <- lm(crim ~ poly(indus, 3), Boston)
summary(crimeOnIndusCubed)
##
## Call:
## lm(formula = crim ~ poly(indus, 3), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.278 -2.514 0.054 0.764 79.713
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.614 0.330 10.950 < 2e-16 ***
## poly(indus, 3)1 78.591 7.423 10.587 < 2e-16 ***
## poly(indus, 3)2 -24.395 7.423 -3.286 0.00109 **
## poly(indus, 3)3 -54.130 7.423 -7.292 1.2e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.423 on 502 degrees of freedom
## Multiple R-squared: 0.2597, Adjusted R-squared: 0.2552
## F-statistic: 58.69 on 3 and 502 DF, p-value: < 2.2e-16
We can see that in for the non-retail business variable the squared and cubed terms are statistically significant so this model has non-linear association.
Next, look at how crime is affected by the river. Since this is a factor variable, it is zero or 1 and thus adding exponents to the model will not change anything. Thus, it does not have non-linear associated in terms of adding exponents.
Next, look at how crime is affected by the nitrogen oxides.
crimeOnNoxCubed <- lm(crim ~ poly(nox, 3), Boston)
summary(crimeOnNoxCubed)
##
## Call:
## lm(formula = crim ~ poly(nox, 3), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.110 -2.068 -0.255 0.739 78.302
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.6135 0.3216 11.237 < 2e-16 ***
## poly(nox, 3)1 81.3720 7.2336 11.249 < 2e-16 ***
## poly(nox, 3)2 -28.8286 7.2336 -3.985 7.74e-05 ***
## poly(nox, 3)3 -60.3619 7.2336 -8.345 6.96e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.234 on 502 degrees of freedom
## Multiple R-squared: 0.297, Adjusted R-squared: 0.2928
## F-statistic: 70.69 on 3 and 502 DF, p-value: < 2.2e-16
We can see that in for the nitrogen oxides variable the squared and cubed terms are statistically significant so this model has non-linear association.
Next, look at how crime is affected by the number of rooms per dwelling.
crimeOnRmCubed <- lm(crim ~ poly(rm, 3), Boston)
summary(crimeOnRmCubed)
##
## Call:
## lm(formula = crim ~ poly(rm, 3), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.485 -3.468 -2.221 -0.015 87.219
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.6135 0.3703 9.758 < 2e-16 ***
## poly(rm, 3)1 -42.3794 8.3297 -5.088 5.13e-07 ***
## poly(rm, 3)2 26.5768 8.3297 3.191 0.00151 **
## poly(rm, 3)3 -5.5103 8.3297 -0.662 0.50858
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.33 on 502 degrees of freedom
## Multiple R-squared: 0.06779, Adjusted R-squared: 0.06222
## F-statistic: 12.17 on 3 and 502 DF, p-value: 1.067e-07
We can see that in for the number of rooms per dwelling variable the squared term is statistically significant so this model has non-linear association.
Next, look at how crime is affected by the proportion of units built prior to 1940.
crimeOnAgeCubed <- lm(crim ~ poly(age, 3), Boston)
summary(crimeOnAgeCubed)
##
## Call:
## lm(formula = crim ~ poly(age, 3), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.762 -2.673 -0.516 0.019 82.842
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.6135 0.3485 10.368 < 2e-16 ***
## poly(age, 3)1 68.1820 7.8397 8.697 < 2e-16 ***
## poly(age, 3)2 37.4845 7.8397 4.781 2.29e-06 ***
## poly(age, 3)3 21.3532 7.8397 2.724 0.00668 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.84 on 502 degrees of freedom
## Multiple R-squared: 0.1742, Adjusted R-squared: 0.1693
## F-statistic: 35.31 on 3 and 502 DF, p-value: < 2.2e-16
We can see that in for the proportion of units built prior to 1940 variable the squared and cubed terms are statistically significant so this model has non-linear association.
Next, look at how crime is affected by the weighted mean distance from Boston centers.
crimeOnDisCubed <- lm(crim ~ poly(dis, 3), Boston)
summary(crimeOnDisCubed)
##
## Call:
## lm(formula = crim ~ poly(dis, 3), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.757 -2.588 0.031 1.267 76.378
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.6135 0.3259 11.087 < 2e-16 ***
## poly(dis, 3)1 -73.3886 7.3315 -10.010 < 2e-16 ***
## poly(dis, 3)2 56.3730 7.3315 7.689 7.87e-14 ***
## poly(dis, 3)3 -42.6219 7.3315 -5.814 1.09e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.331 on 502 degrees of freedom
## Multiple R-squared: 0.2778, Adjusted R-squared: 0.2735
## F-statistic: 64.37 on 3 and 502 DF, p-value: < 2.2e-16
We can see that in for the weighted mean distance from Boston centers variable the squared and cubed terms are statistically significant so this model has non-linear association.
Next, look at how crime is affected by the accessibility to radical highways.
crimeOnRadCubed <- lm(crim ~ poly(rad, 3), Boston)
summary(crimeOnRadCubed)
##
## Call:
## lm(formula = crim ~ poly(rad, 3), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.381 -0.412 -0.269 0.179 76.217
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.6135 0.2971 12.164 < 2e-16 ***
## poly(rad, 3)1 120.9074 6.6824 18.093 < 2e-16 ***
## poly(rad, 3)2 17.4923 6.6824 2.618 0.00912 **
## poly(rad, 3)3 4.6985 6.6824 0.703 0.48231
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.682 on 502 degrees of freedom
## Multiple R-squared: 0.4, Adjusted R-squared: 0.3965
## F-statistic: 111.6 on 3 and 502 DF, p-value: < 2.2e-16
We can see that in for the index of accessibility to radical highways variable the squared term is statistically significant so this model has non-linear association. Since the squared variable is not as significant as others, the non-linearity is not as bad as others, but still present.
Next, look at how crime is affected by the property tax rate.
crimeOnTaxCubed <- lm(crim ~ poly(tax, 3), Boston)
summary(crimeOnTaxCubed)
##
## Call:
## lm(formula = crim ~ poly(tax, 3), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.273 -1.389 0.046 0.536 76.950
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.6135 0.3047 11.860 < 2e-16 ***
## poly(tax, 3)1 112.6458 6.8537 16.436 < 2e-16 ***
## poly(tax, 3)2 32.0873 6.8537 4.682 3.67e-06 ***
## poly(tax, 3)3 -7.9968 6.8537 -1.167 0.244
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.854 on 502 degrees of freedom
## Multiple R-squared: 0.3689, Adjusted R-squared: 0.3651
## F-statistic: 97.8 on 3 and 502 DF, p-value: < 2.2e-16
We can see that in for the property tax rate variable the squared term is statistically significant so this model has non-linear association.
Next, look at how crime is affected by pupil-teacher ratio by town.
crimeOnPtratioCubed <- lm(crim ~ poly(ptratio, 3), Boston)
summary(crimeOnPtratioCubed)
##
## Call:
## lm(formula = crim ~ poly(ptratio, 3), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.833 -4.146 -1.655 1.408 82.697
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.614 0.361 10.008 < 2e-16 ***
## poly(ptratio, 3)1 56.045 8.122 6.901 1.57e-11 ***
## poly(ptratio, 3)2 24.775 8.122 3.050 0.00241 **
## poly(ptratio, 3)3 -22.280 8.122 -2.743 0.00630 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.122 on 502 degrees of freedom
## Multiple R-squared: 0.1138, Adjusted R-squared: 0.1085
## F-statistic: 21.48 on 3 and 502 DF, p-value: 4.171e-13
We can see that in for the pupil to teaher ratio variable the squared and cubed terms are statistically significant so this model has non-linear association.
Next, look at how crime is affected by the proportion of African-Americans.
crimeOnBlackCubed <- lm(crim ~ poly(black, 3), Boston)
summary(crimeOnBlackCubed)
##
## Call:
## lm(formula = crim ~ poly(black, 3), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.096 -2.343 -2.128 -1.439 86.790
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.6135 0.3536 10.218 <2e-16 ***
## poly(black, 3)1 -74.4312 7.9546 -9.357 <2e-16 ***
## poly(black, 3)2 5.9264 7.9546 0.745 0.457
## poly(black, 3)3 -4.8346 7.9546 -0.608 0.544
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.955 on 502 degrees of freedom
## Multiple R-squared: 0.1498, Adjusted R-squared: 0.1448
## F-statistic: 29.49 on 3 and 502 DF, p-value: < 2.2e-16
Since the squared and cubed terms of this model are not significantly significant, we can say that African-American variable has a linear associated when just predicted with the crime rate per capita.
Next, look at how crime is affected by the percentage of the lower status of population.
crimeOnLstatCubed <- lm(crim ~ poly(lstat, 3), Boston)
summary(crimeOnLstatCubed)
##
## Call:
## lm(formula = crim ~ poly(lstat, 3), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.234 -2.151 -0.486 0.066 83.353
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.6135 0.3392 10.654 <2e-16 ***
## poly(lstat, 3)1 88.0697 7.6294 11.543 <2e-16 ***
## poly(lstat, 3)2 15.8882 7.6294 2.082 0.0378 *
## poly(lstat, 3)3 -11.5740 7.6294 -1.517 0.1299
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.629 on 502 degrees of freedom
## Multiple R-squared: 0.2179, Adjusted R-squared: 0.2133
## F-statistic: 46.63 on 3 and 502 DF, p-value: < 2.2e-16
We can see that in for the percentage of the lower status of the population variable the squared term is statistically significant so this model has non-linear association.
Lastly, look at how crime is affected by the median value of owned homes in thousands of dollars.
crimeOnMedvCubed <- lm(crim ~ poly(medv, 3), Boston)
summary(crimeOnMedvCubed)
##
## Call:
## lm(formula = crim ~ poly(medv, 3), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.427 -1.976 -0.437 0.439 73.655
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.614 0.292 12.374 < 2e-16 ***
## poly(medv, 3)1 -75.058 6.569 -11.426 < 2e-16 ***
## poly(medv, 3)2 88.086 6.569 13.409 < 2e-16 ***
## poly(medv, 3)3 -48.033 6.569 -7.312 1.05e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.569 on 502 degrees of freedom
## Multiple R-squared: 0.4202, Adjusted R-squared: 0.4167
## F-statistic: 121.3 on 3 and 502 DF, p-value: < 2.2e-16
We can see that in for the median value of owned homes in thousands of dollars variable the squared and cubed terms are statistically significant so this model has non-linear association.
In conclusion, all of the predictor variables have non-linear association except the variables that describe if it bounds the river (since it is a factor of 0 or 1) and the proportion of African-Americans by town. Otherwise, the other 11 predictor variables have non-linear relationships in either cubic or quadratic terms.